Outline 0 Introduction In this paper…… An overview of the purpose of your analysis including details or references necessary.

1 Overview 1.1 Business Problem 1.2 Business Value Proposition

2 Data Description ◦the source ◦the variables with definitions or a link to a codebook ◦the number of observations in your data set ◦detail on missingness ◦a glimpse of your data if possible (e.g. head and tail)

3 Data Preprocessing ◦Feature generation ◦Imputation ◦Cleaning or merging of categories ◦Outlier removal ◦Anything that changes your data from the original form

4.Your Final Analysis in small pieces with annotation 5.Graphs to visualize different steps in your analysis 6.Clear discussion of why you made analysis choices 7.References to papers or citations you used to make decisions about the analysis

1 Introduction

This article aims to provide a reproducible analytics workflow solving a hypothetical business problem using a predictive regression model on the UCI Wine data. A hypothetical business value proposition along with the full analytics solution will be provided. The important processes will also be documented.

1.1 Data Background

The datasets used is obtained from UCI Machine Learning Repository. The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: https://www.vinhoverde.pt/en/

1.2 Business Problem

The business background of the project is to provide an analytics solution to the sales team in assisting with predicting wine quality / expert rating using only its chemical feature. The sales team would like to estimate on the wine quality before being in an expert so that they can potentially reduce cost by only have the expert rate the wine that’s estimated to have higher quality.

1.3 Business Value Proposition

2 Data Overview

2.1 Source of Data

Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal, 2009

Download red wine dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

Download white wine dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

2.2 Dataset Description

The datasets include 12 variables:

Input variables:

  • fixed.acidity
  • volatile.acidity
  • citric.acid
  • residual.sugar
  • chlorides
  • free.sulfur.dioxide
  • density
  • pH
  • sulphates
  • alcohol
  • quality

Output Variable:

  • wine.type (Score between 0 and 10)

For more information, read from this file: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names

2.3 A Glimps of Data

There are 12 columns available in both datasets. We can see that both datasets have the same attributes with the same data type, therefore it is reasonable for them to be combined.

wineRed = read.csv("winequality-red.csv",sep = ";")
wineWhite = read.csv("winequality-white.csv",sep = ";")

RedWine

head(wineRed)

RedWine Structure

str(wineRed)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

WhiteWine

head(wineWhite)

WhiteWine Structure

str(wineWhite)
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

3 Data Preprocessing

Prepare the data by checking missing data, merging datasets, and creating functions necessary for later analysis.

Metadata<-function(df){
  library(DataExplorer)
  library(kableExtra)
  z<-introduce(df)
  z<-as.data.frame(t(z))
  colnames(z)<-c()
  knitr::kable(
    z,
    caption="Data Introduction"
  ) %>% kable_styling(bootstrap_options = c("striped", "hover"),
                      full_width = F,
                      font_size = 12,
                      position = "left")
  
}

3.1 Missing Data

Based on the output below, we can see that there are no missing data present in both red and white wine datasets.

library(tidyverse)
wineRed %>% is.na() %>% colSums()
##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0
wineWhite %>% is.na() %>% colSums()
##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

3.2 Merging Datasets

The different wine types are transformed into dummies and stored in a categorical variable named wine.type.

# Create Dummies
wineRed$wine.type <- 1
wineWhite$wine.type <- 0

# Check Dimensions
dim(wineRed)
## [1] 1599   13
dim(wineWhite)
## [1] 4898   13
# Combine dataframes
wine = rbind(wineRed, wineWhite)

# Recode attribute as factor
wine$wine.type <- as.factor(wine$wine.type)

# Double check dimension
dim(wine)
## [1] 6497   13

MetaData For Red Wine

Data Introduction
rows 1599
columns 13
discrete_columns 0
continuous_columns 13
all_missing_columns 0
total_missing_values 0
complete_rows 1599
total_observations 20787
memory_usage 163576

MetaData For White Wine

Data Introduction
rows 4898
columns 13
discrete_columns 0
continuous_columns 13
all_missing_columns 0
total_missing_values 0
complete_rows 4898
total_observations 63674
memory_usage 493472

MetaData For Wine

Data Introduction
rows 6497
columns 13
discrete_columns 1
continuous_columns 12
all_missing_columns 0
total_missing_values 0
complete_rows 6497
total_observations 84461
memory_usage 627904

4 Explanatory Data Analysis

To explore the dataset, we can take a look at a scatterplot for the response. We will use histograms for numerical variables, bar chart for categorical variables. Some other summarized statistics were also shown down below.

4.1 Dataset Summary

From the scatterplot, we cannot see any obvious trend between any variables and response.

From the histogram, we can see that most numeric variables are right skewed and unimodel which makes sense as these variables can’t have negative values. PH and quality are both normally distributed. Total sulfur dioxide seems to be bimodel.

The barplot shows the distribution of red and white wine with more white wine in the data.

Head of Wine

head(wine)

Scatterplot

## Scatterplot `quality` with all other continuous features
plot_scatterplot(split_columns(wine)$continuous, by = "quality", sampled_rows = 1000L)

Histograms

library(DataExplorer)
plot_histogram(wine)

Bar Plot

library(DataExplorer)
plot_bar(wine)

Statistical Summary

summary(wine)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0        Min.   :0.9871  
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0        1st Qu.:0.9923  
##  Median :0.04700   Median : 29.00      Median :118.0        Median :0.9949  
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7        Mean   :0.9947  
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0        3rd Qu.:0.9970  
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality      wine.type
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000   0:4898   
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50   1st Qu.:5.000   1:1599   
##  Median :3.210   Median :0.5100   Median :10.30   Median :6.000            
##  Mean   :3.219   Mean   :0.5313   Mean   :10.49   Mean   :5.818            
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30   3rd Qu.:6.000            
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :9.000

Data Structure

str(wine)
## 'data.frame':    6497 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ wine.type           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

4.2 Correlation & Multicolinearity

4.2.1 Correlation Plot

library(DataExplorer)
plot_correlation(wine, type = "c")

4.2.2 Strong Correlations

Based on the correlation plot above, some of the predictors have strong correlations:

  • Between density and alcohol (-0.69)
  • Between free.sulfur.dioxide and total.sulfur.dioxide (0.72)

These strongly correlated variable could potentially lead to multicolinearity.

5 Modeling

5.1 Spliting Data

set.seed(13)
trainIndex = sample(1:nrow(wine), size = round(0.75*nrow(wine)), replace=FALSE)
train<-wine[trainIndex, ]
valid<-wine[-trainIndex, ]
nrow(train)
## [1] 4873
nrow(valid)
## [1] 1624

5.2 Stepwise Model

library(MASS)
# Fit the full model 
full <- lm(quality ~., data = train)
# Stepwise regression model
step <- stepAIC(full, direction = "both", trace = FALSE)
summary(step)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     free.sulfur.dioxide + total.sulfur.dioxide + density + pH + 
##     sulphates + alcohol + wine.type, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5440 -0.4690 -0.0480  0.4567  3.0005 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.178e+02  1.798e+01   6.553 6.23e-11 ***
## fixed.acidity         1.027e-01  1.895e-02   5.419 6.28e-08 ***
## volatile.acidity     -1.530e+00  8.799e-02 -17.392  < 2e-16 ***
## residual.sugar        6.846e-02  7.206e-03   9.501  < 2e-16 ***
## free.sulfur.dioxide   3.898e-03  8.880e-04   4.390 1.16e-05 ***
## total.sulfur.dioxide -1.261e-03  3.756e-04  -3.356 0.000796 ***
## density              -1.179e+02  1.825e+01  -6.457 1.17e-10 ***
## pH                    6.013e-01  1.066e-01   5.642 1.78e-08 ***
## sulphates             6.918e-01  8.790e-02   7.871 4.32e-15 ***
## alcohol               2.169e-01  2.290e-02   9.469  < 2e-16 ***
## wine.type1            3.660e-01  6.754e-02   5.418 6.30e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7342 on 4862 degrees of freedom
## Multiple R-squared:  0.2991, Adjusted R-squared:  0.2977 
## F-statistic: 207.5 on 10 and 4862 DF,  p-value: < 2.2e-16

6 Results and Performance

6.1 Model Interpretation

6.2 Prediciton using Validaiton data

p.valid<-predict(step, newdata=valid)
head(p.valid)
##        5        7       13       15       17       20 
## 4.928785 5.066270 5.376766 5.228540 6.034067 5.690349

6.3 Model Evaluation

library(caret)
RMSE(p.valid, valid$quality)
## [1] 0.7319179
R2(p.valid, valid$quality)
## [1] 0.2857878

7 Conclusion

7.1 One number

pressure
paste("The mean pressure is:", round(mean(pressure$pressure),3), "mm")

[1] “The mean pressure is: 124.337 mm”

7.2 A table

library(knitr)
kable(head(pressure), format="pipe", digit=3)
temperature pressure
0 0.000
20 0.001
40 0.006
60 0.030
80 0.090
100 0.270
kable(tail(pressure), format="pipe", digit=3)
temperature pressure
14 260 96
15 280 157
16 300 247
17 320 376
18 340 558
19 360 806

8 Including Plots

You can also embed plots, for example:

Plotly

Map

For more details on organizing with tabset go here https://bookdown.org/yihui/rmarkdown-cookbook/html-tabs.html.

9 Equations

You can include both inline and offset equations.

9.1 Inline equations

You can include inline equations like \(y = nx + b\), you can also do more complicated inline equations such as \(\hat{y} = \hat{\beta} + \hat{\beta_1}x\)

9.2 Offset equations

10 References